Final: Effects of Air Pollution on Countries
1. Introduction
Motivation
Climate issues are pervasive but typically disproportionately affect low income communities and developing countries. Our group wanted to explore how air pollution has changed over time and affect countries differently. Specifically, we wanted to analyze how a country’s economic and social position can either increase, decrease, or not have observable impact on the affects of air pollution. In laymen terms, does air pollution affect underdeveloped countries disproportionately?
Set Up
Before we start, we need to ensure that we have all the relevant libraries installed and imported.
Run these in the console, or only the ones that your system does not have, to install packages in addition to the ezids package.
install.packages("tidyverse")
install.packages("rworldmap")
install.packages("tmap")
install.packages("spData")
install.packages("sf")
install.packages("ggpubr")
install.packages("dplyr")
install.packages("knitr")
install.packages("magrittr")
install.packages("factoextra")
install.packages("cluster")
install.packages("caret")
2. Data Sources and Data Wrangling
Data Sources
For our analysis, we will be working with 5 main data sources shown in the table below:
| Data | Source | Link |
|---|---|---|
| Deaths Due to Air Pollution of Countries from 1990 - 2017 | Kaggle | Link |
| GDP Annual Growth of Countries from 1960 - 2020 | Kaggle via WorldBank | Link |
| United Nations Population and Region Data | United Nations | Link |
| United Nations ISO-alpha3 code | United Nations | Link |
| spData for Map Geometries | spData for Mapping | Link |
The main variables in our datasets will include:
| Feature | Data Type | Unit of Measure | Notes and Assumptions |
|---|---|---|---|
| GDP (Gross Domestic Product) | Numerical, Continuous | $USD | This is our chosen proxy for measuring a country’s economic status |
| Population Size | Numerical, Continuous | thousands of people | Annual UN estimated |
| Deaths due to Air Pollution | Numerical, Continuous | deaths per million | This is our chosen proxy for measuring the negative affects of air pollution. |
| Country | Qualitative, Categorical | N/A | 231 countries |
| SDG Region | Qualitative, Categorical | N/A | UN’s Sustainable Development Goals Region Classification. |
| Sub Region | Qualitative, Categorical | N/A | UN’s Sustainable Development Goals Sub-Region Classification. |
| ISO-alpha3 Country Code | Qualitative, Categorical | N/A | Standard for identifying countries (text ID). |
| ISO-alpha2 Country Code | Qualitative, Categorical | N/A | Another standard for identifying countries (text ID). |
| M49 Country Code | Numerical, Categorical | N/A | Another standard for identifying countries (numerical ID). |
| Year | Numerical, Categorical | N/A | 1990 to 2017 |
| GDP per Capita | Numerical, Continuous | $USD per person | Normalization of GDP to compare between population sizes (calculated). |
Data Wrangling
While data from Kaggle are already in a format to be cleaned, downloaded data from United Nations required a little data wrangling. Mainly, we needed to extract just countries’ data from the Excel workbooks and into their own contained csv files. Since we only need to do this once and programming it would take significant time to choose the specific cells that we need, we opted to perform this step outside of R and in Excel. Note that if this were a part of a real production data pipeline, we would take the time to program the data extraction but would likely choose a different programming language such as Python that is a bit more robust in these types of tasks like web scraping and data transformations in Pandas.
- Figure 3: Sample screenshot of data downloaded from UN including unnecessary elements like banners and other regional data.
- Figure 4: Sample screenshot of transformed UN dataset.
3. Load, Clean, and Inspect Data
Load Data
| variable | class | first_values |
|---|---|---|
| Country.or.Area | character | Andorra, United Arab Emirates (the), Afghanistan, Antigua and Barbuda, Anguilla, Albania |
| ISO.alpha2.code | character | AD, AE, AF, AG, AI, AL |
| ISO.alpha3.code | character | AND, ARE, AFG, ATG, AIA, ALB |
| M49.code | integer | 20, 784, 4, 28, 660, 8 |
| variable | class | first_values |
|---|---|---|
| Entity | character | Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanistan |
| Code | character | AFG, AFG, AFG, AFG, AFG, AFG |
| Year | integer | 1990, 1991, 1992, 1993, 1994, 1995 |
| Air.pollution..total…deaths.per.100.000. | double | 299.477308883281, 291.277966734046, 278.963055615066, 278.790814746341, 287.162923177255, 288.01422374243 |
| Indoor.air.pollution..deaths.per.100.000. | double | 250.362909742375, 242.575124973334, 232.043877894811, 231.648133503794, 238.837176822107, 239.906598716878 |
| Outdoor.particulate.matter..deaths.per.100.000. | double | 46.4465894382846, 46.0338405670284, 44.2437660321924, 44.4401481443785, 45.5943284100213, 45.3671411300974 |
| Outdoor.ozone.pollution..deaths.per.100.000. | double | 5.61644203074918, 5.60396011603667, 5.61182206482564, 5.65526606275628, 5.71892222061506, 5.73917378233707 |
| variable | class | first_values |
|---|---|---|
| Country.Name | character | Aruba, Afghanistan, Angola, Albania, Andorra, Arab World |
| Country.Code | character | ABW, AFG, AGO, ALB, AND, ARB |
| Indicator.Name | character | GDP (current US\(), GDP (current US\)), GDP (current US\(), GDP (current US\)), GDP (current US\(), GDP (current US\)) |
| Indicator.Code | character | NY.GDP.MKTP.CD, NY.GDP.MKTP.CD, NY.GDP.MKTP.CD, NY.GDP.MKTP.CD, NY.GDP.MKTP.CD, NY.GDP.MKTP.CD |
| X1960 | double | NA, 537777811.111111, NA, NA, NA, NA |
| X1961 | double | NA, 548888895.555556, NA, NA, NA, NA |
| X1962 | double | NA, 546666677.777778, NA, NA, NA, NA |
| X1963 | double | NA, 751111191.111111, NA, NA, NA, NA |
| X1964 | double | NA, 800000044.444444, NA, NA, NA, NA |
| X1965 | double | NA, 1006666637.77778, NA, NA, NA, NA |
| variable | class | first_values |
|---|---|---|
| SDGRegion | character | SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA, SUB-SAHARAN AFRICA |
| SubRegion | character | Eastern Africa, Eastern Africa, Eastern Africa, Eastern Africa, Eastern Africa, Eastern Africa |
| Country | character | Burundi, Comoros, Djibouti, Eritrea, Ethiopia, Kenya |
| Notes | integer | NA, NA, NA, NA, NA, NA |
| Country.code | integer | 108, 174, 262, 232, 231, 404 |
| Type | character | Country/Area, Country/Area, Country/Area, Country/Area, Country/Area, Country/Area |
| Parent.code | integer | 910, 910, 910, 910, 910, 910 |
| X1950 | character | 2 309, 159, 62, 822, 18 128, 6 077 |
| X1951 | character | 2 360, 163, 63, 835, 18 467, 6 242 |
| X1952 | character | 2 406, 167, 65, 849, 18 820, 6 416 |
| variable | class | first_values |
|---|---|---|
| iso_a2 | character | FJ, TZ, EH, CA, US, KZ |
| name_long | character | Fiji, Tanzania, Western Sahara, Canada, United States, Kazakhstan |
| continent | character | Oceania, Africa, Africa, North America, North America, Asia |
| region_un | character | Oceania, Africa, Africa, Americas, Americas, Asia |
| subregion | character | Melanesia, Eastern Africa, Northern Africa, Northern America, Northern America, Central Asia |
| type | character | Sovereign country, Sovereign country, Indeterminate, Sovereign country, Country, Sovereign country |
| area_km2 | double | 19289.9707329765, 932745.792357074, 96270.6010408472, 10036042.9767873, 9510743.74482458, 2729810.51298781 |
| pop | double | 885806, 52234869, NA, 35535348, 318622525, 17288285 |
| lifeExp | double | 69.96, 64.163, NA, 81.9530487804878, 78.8414634146341, 71.62 |
| gdpPercap | double | 8222.25378436842, 2402.09940362843, NA, 43079.1425247165, 51921.9846391384, 23587.3375151466 |
Clean Data
First thing that we need to drop unnecessary columns and set datatypes (factor, num, etc.).
Clean air_pollution_df:
| variable | class | first_values |
|---|---|---|
| Country | integer | Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanistan |
| ISO.alpha3.code | integer | AFG, AFG, AFG, AFG, AFG, AFG |
| Year | integer | 1990, 1991, 1992, 1993, 1994, 1995 |
| Deaths.Air.Pollution.per.100k | double | 299.477308883281, 291.277966734046, 278.963055615066, 278.790814746341, 287.162923177255, 288.01422374243 |
Clean gdp_df:
| variable | class | first_values |
|---|---|---|
| Country | integer | Aruba, Aruba, Aruba, Aruba, Aruba, Aruba |
| ISO.alpha3.code | integer | ABW, ABW, ABW, ABW, ABW, ABW |
| Year | integer | 1986, 1987, 1988, 1989, 1990, 1991 |
| GDP.USD | double | 405463417.11746, 487602457.746416, 596423607.114715, 695304363.031101, 764887117.194486, 872138715.083799 |
Clean population_region_df:
| variable | class | first_values |
|---|---|---|
| SDGRegion | integer | SUB-SAHARANAFRICA, SUB-SAHARANAFRICA, SUB-SAHARANAFRICA, SUB-SAHARANAFRICA, SUB-SAHARANAFRICA, SUB-SAHARANAFRICA |
| SubRegion | integer | EasternAfrica, EasternAfrica, EasternAfrica, EasternAfrica, EasternAfrica, EasternAfrica |
| Country | integer | Burundi, Burundi, Burundi, Burundi, Burundi, Burundi |
| M49.code | integer | 108, 108, 108, 108, 108, 108 |
| Year | integer | 1950, 1951, 1952, 1953, 1954, 1955 |
| Population.thousands | double | 2309, 2360, 2406, 2449, 2492, 2537 |
Clean population_region_df:
| variable | class | first_values |
|---|---|---|
| Country.or.Area | integer | Andorra, United Arab Emirates (the), Afghanistan, Antigua and Barbuda, Anguilla, Albania |
| ISO.alpha2.code | integer | AD, AE, AF, AG, AI, AL |
| ISO.alpha3.code | integer | AND, ARE, AFG, ATG, AIA, ALB |
| M49.code | integer | 20, 784, 4, 28, 660, 8 |
Clean world:
| variable | class | first_values |
|---|---|---|
| iso_a2 | integer | FJ, TZ, EH, CA, US, KZ |
Note that we only have geometries for 175 countries, some will not be able to be plot on a map but that is okay.
Final DataFrame Construction
Now let’s merge our 4 datasets into one using a series of inner joins using country code and year as keys depending on the specific join. We are using inner joins because we want to drop all null values which would mean either a country does not have a country code or we have more years of data than our smallest year range (the air pollution dataset).
| variable | class | first_values |
|---|---|---|
| ISO.alpha2.code | integer | AD, AD, AD, AD, AD, AD |
| M49.code | integer | 20, 20, 20, 20, 20, 20 |
| Year | integer | 2012, 2013, 1990, 1991, 1992, 1993 |
| ISO.alpha3.code | integer | AND, AND, AND, AND, AND, AND |
| Country.x | integer | Andorra, Andorra, Andorra, Andorra, Andorra, Andorra |
| Deaths.Air.Pollution.per.100k | double | 17.6754871826169, 17.1893417774086, 29.0238806202567, 28.6956788863825, 28.4603211317312, 27.8408717612189 |
| GDP.USD | double | 3188808942.56713, 3193704343.20627, 1029048481.88051, 1106928582.86629, 1210013651.87713, 1007025755.00065 |
| SDGRegion | integer | EUROPE, EUROPE, EUROPE, EUROPE, EUROPE, EUROPE |
| SubRegion | integer | SouthernEurope, SouthernEurope, SouthernEurope, SouthernEurope, SouthernEurope, SouthernEurope |
| Population.thousands | double | 82, 81, 55, 57, 59, 61 |
| geom | list | list(), list(), list(), list(), list(), list() |
| gdp.per.capita | double | 38887.9139337455, 39428.4486815589, 18709.9723978275, 19419.7996994086, 20508.7059640192, 16508.6189344369 |
Our dataset is finally ready to be analyzed.
4. EDA Summary
All necessary EDA was performed in the Midterm Report where we looked at distributions for each key numerical variable in Air Pollution Induced Deaths per 100k, Population, GDP per Capita. We also looked into boxplots of these key numerical variables by Regions and Sub Regions. Interestingly, we observed that the same subregions that have low deaths caused by air pollution also have high GDP per capita comparatively.
Next, we checked for outliers and found that Deaths.Air.Pollution.per.100k definitely do not need to have outliers removed as it only represents 0.7% of the dataset. gdp.per.capita have higher percentage of classified outliers at 13.5%, however, we felt that keeping the extreme datapoints of this feature was important in our research and analysis to capture the true disproportionate distribution of wealth and progress across countries in our world.
Lastly, we explored a few choropleth maps to visually understand our dataset better.
We intentionally did not include any figures from EDA in this final report to preserve space. Please look at the Midterm Report for details on our entire EDA process.
5. Main Midterm Research Question
Below is a rehash of our main research question from the Midterm report and we felt that the findings from this model was important enough to warrant inclusion in this final report as well since some of our new SMART questions are building on the work from this analysis.
Do lower GDP countries have more deaths per 100k due to air pollution?
Is there a correlation between GDP per capita and deaths caused by pollution? Is it linear? How strong is the correlation?
Linear Fit
Let’s first look at the general fit on the overall data.
- Figure 15: Linear model (fit1) on overall data, deaths due to air pollution per 100k vs GDP per capita, 1990 to 2017.
From the plot, we observed that there is indeed a negative correlation between deaths due to air pollution per 100k and GDP per capita. However, the strength of that relationship is not particularly strong as the R2 is really low at 0.295. This means that only 29% of the variance experienced in deaths due to air pollution per 100k is caused by GDP per capita in a linear relationship.
Transformed Log Scale - Linear Fit
We then performed a log transformation and found that our linear regression fits much better.
fit2’s summary statistics are:
##
## Call:
## lm(formula = log(Deaths.Air.Pollution.per.100k) ~ log(gdp.per.capita),
## data = final_df_sf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.099 -0.235 0.000 0.206 1.431
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.38778 0.02663 277 <0.0000000000000002 ***
## log(gdp.per.capita) -0.38952 0.00323 -121 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.369 on 5195 degrees of freedom
## Multiple R-squared: 0.737, Adjusted R-squared: 0.737
## F-statistic: 1.45e+04 on 1 and 5195 DF, p-value: <0.0000000000000002
We re-plotted the linear model and here were the results.
- Figure 16: Fitting to a log(y) = (m)(log(x)) + b curve yields much stronger relationship.
Across the board, the strength of our linear relationship increases dramatically when first transforming both features by the log() function first. The new R2 is now 0.737 which means around 74% of the variance in our target feature can be explained by this mathematical relationship.
We can then predict a country’s deaths caused from air pollution in a given year by using the country’s GDP per capita with the following equation:
\[ log(Deaths_{from~air~pollution|per~year|per~country} / 100,000) = 7.38778 - 0.38952 * log(GDP_{per capita}) ~~~~~~~~~~~~~~~~ eqn (1) \]
or solving for our target variable:
\[ Deaths_{from~air~pollution|per~year|per~country} = 10^{7.38778 - 0.38952 * log(GDP per capita)} * 100,000 ~~~~~~~~~~~~~~~~ eqn (2) \]
Is there a difference in means of death caused by pollution between low, mid, and high GDP per capita?
We all know that correlation does not necessarily mean causation. Let us dig a little deeper and test if means of deaths caused by air pollution per 100k across different GDP per capita levels are equal or not.
One-Way ANOVA Test
We started off by performing a One-Way ANOVA test to determine if the means of deaths caused by air pollution per 100k across different GDP per capita levels are equal or not.
H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp = \(\mu\)deaths_medium_gdp = \(\mu\)deaths_high_gdp
H1: At least one of \(\mu\)deaths_lowest_gdp, \(\mu\)deaths_low_gdp, \(\mu\)deaths_medium_gdp, \(\mu\)deaths_high_gdp is not equal
We will use an \(\alpha\) value of 0.05.
## $`1`
## [1] 95.2 920.3
##
## $`2`
## [1] 921 3100
##
## $`3`
## [1] 3104 11493
##
## $`4`
## [1] 11497 119106
## Df Sum Sq Mean Sq F value Pr(>F)
## qnt 3 10735714 3578571 3133 <0.0000000000000002 ***
## Residuals 5193 5932162 1142
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-valuetest1 is 0e+00, which is lower than \(\alpha\)0.05. Therefore, we reject our null hypothesis that \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp = \(\mu\)deaths_medium_gdp = \(\mu\)deaths_high_gdp. This means that there is statistically significant that at least one of the means of deaths in low, medium, and high GDP per capita are not the same.
2-Sample T-Tests
We will conduct 6 2-sample t-tests to determine if each of the groupings are different from each other:
- Lowest GDP per capita’s deaths does not equal Low GDP per capita’s deaths
- H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_low_gdp
- H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_low_gdp
- Low GDP per capita’s deaths does not equal Medium GDP per capita’s deaths
- H0: \(\mu\)deaths_low_gdp = \(\mu\)deaths_medium_gdp
- H1: \(\mu\)deaths_low_gdp != \(\mu\)deaths_medium_gdp
- Medium GDP per capita’s deaths does not equal High GDP per capita’s deaths
- H0: \(\mu\)deaths_medium_gdp = \(\mu\)deaths_high_gdp
- H1: \(\mu\)deaths_medium_gdp != \(\mu\)deaths_high_gdp
- Lowest GDP per capita’s deaths does not equal High GDP per capita’s deaths
- H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_high_gdp
- H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_high_gdp
- Lowest GDP per capita’s deaths does not equal Medium GDP per capita’s deaths
- H0: \(\mu\)deaths_lowest_gdp = \(\mu\)deaths_medium_gdp
- H1: \(\mu\)deaths_lowest_gdp != \(\mu\)deaths_medium_gdp
- Low GDP per capita’s deaths does not equal Highest GDP per capita’s deaths
- H0: \(\mu\)deaths_low_gdp = \(\mu\)deaths_high_gdp
- H1: \(\mu\)deaths_low_gdp != \(\mu\)deaths_high_gdp
We used a two sample t-test for each and used an \(\alpha\) value of 0.05.
Test 1:
Conclusion of test1: p-valuetest1 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_lowest_gdp is equal to \(\mu\)deaths_low_gdp and accept our alternative hypothesis.
Test 2:
Conclusion of test2: p-valuetest2 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_low_gdp is equal to \(\mu\)deaths_medium_gdp and accept our alternative hypothesis.
Test 3:
Conclusion of test3: p-valuetest3 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_medium_gdp is equal to \(\mu\)deaths_high_gdp and accept our alternative hypothesis.
Test 4:
Conclusion of test4: p-valuetest4 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_lowest_gdp is equal to \(\mu\)deaths_high_gdp and accept our alternative hypothesis.
Test 5:
Conclusion of test5: p-valuetest5 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_lowest_gdp is equal to \(\mu\)deaths_medium_gdp and accept our alternative hypothesis.
Test 6:
Conclusion of test6: p-valuetest6 is less than \(\alpha\)0.05, therefore we reject our null hypothesis that \(\mu\)deaths_low_gdp is equal to \(\mu\)deaths_high_gdp and accept our alternative hypothesis.
Midterm Main Research Results
From all of our tests above, we can confirm that the means of deaths caused by air pollution are statistically significant when grouped by different levels of GDP per capita. This reinforces the idea that deaths caused by air pollution has a significant relationship with GDP per capita and the model can be quantified by Equation 2:
\[ Deaths_{from~air~pollution|per~year|per~country} = 10^{7.38778 - 0.38952 * log(GDP per capita)} * 100,000 ~~~~~~~~~~~~~~~~ eqn (2) \]
The strength of the correlation can be quantified by our R2 value of 0.737 from Figure 16.
6. Smart Questions for Further Modeling
1. What are the impacts of population size from GDP and Deaths due to Air Pollution globally?
- Categorizing GDP into low (0), medium(1) and high(2)
SQ6 <- final_df
library(dplyr)
groupings_pop <- SQ6 %>% mutate(population_grouping = case_when(Population.thousands >= 100001 & Population.thousands <= 1000000000 ~ 2,
Population.thousands >= 50001 & Population.thousands <= 100001 ~ 1,
Population.thousands >= 0 & Population.thousands <= 50000 ~ 0)) # end function- Building a logit model GDP as predictor on Population
pop_gdpLogit <- glm(population_grouping ~ groupings_pop$gdp.per.capita, data = groupings_pop)summary(pop_gdpLogit)##
## Call:
## glm(formula = population_grouping ~ groupings_pop$gdp.per.capita,
## data = groupings_pop)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.246 -0.185 -0.181 -0.180 1.820
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 0.180177461 0.008492388 21.22
## groupings_pop$gdp.per.capita 0.000000553 0.000000456 1.21
## Pr(>|t|)
## (Intercept) <0.0000000000000002 ***
## groupings_pop$gdp.per.capita 0.23
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.267)
##
## Null deviance: 1389.8 on 5196 degrees of freedom
## Residual deviance: 1389.4 on 5195 degrees of freedom
## AIC: 7899
##
## Number of Fisher Scoring iterations: 2
- Bulding a logit model GDP and Deaths due to Air Population as predictor on Population
pop_gdp_deathsLogit <- glm(population_grouping ~ groupings_pop$gdp.per.capita + groupings_pop$Deaths.Air.Pollution.per.100k, data = groupings_pop)summary(pop_gdp_deathsLogit)##
## Call:
## glm(formula = population_grouping ~ groupings_pop$gdp.per.capita +
## groupings_pop$Deaths.Air.Pollution.per.100k, data = groupings_pop)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.215 -0.194 -0.186 -0.169 1.842
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 0.202910224 0.018188095 11.16
## groupings_pop$gdp.per.capita 0.000000136 0.000000543 0.25
## groupings_pop$Deaths.Air.Pollution.per.100k -0.000213150 0.000150811 -1.41
## Pr(>|t|)
## (Intercept) <0.0000000000000002 ***
## groupings_pop$gdp.per.capita 0.80
## groupings_pop$Deaths.Air.Pollution.per.100k 0.16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.267)
##
## Null deviance: 1389.8 on 5196 degrees of freedom
## Residual deviance: 1388.9 on 5194 degrees of freedom
## AIC: 7899
##
## Number of Fisher Scoring iterations: 2
2. What are the effects of (low or high) GDP and population size on deaths due to air pollution in Sub-Saharan Africa?
Here we use decision tree classification algorithm to predict which countries in Sub-Saharan Africa are more likely to have lower or higher deaths per 100,000 due to air pollution. We use predictors -GDP per capita and population size in thousands.
As we learned this semester, a qualitative response warrants the utilization of a classification tree. This helps us predict you predict that each observation belongs to the most commonly occurring class of training observations in the region to which it belongs.
Below, we subset our data to only relect Sub-Saharan countries, and then we create a new variable deaths that recategorizes existing variable Deaths.Air.Pollution.per.100k into "high" or "low" based on the median deaths due to air pollution from 1990 to 2017.
Since our median is 143. 5, we set deaths = “high” if Deaths.Air.Pollution.per.100k is greater than or equal to 143, and deaths = “low” if Deaths.Air.Pollution.per.100k is less than 143. Lastly, we transform the deaths variable into a factor variable
## [1] "ISO.alpha2.code" "M49.code"
## [3] "Year" "ISO.alpha3.code"
## [5] "Country.x" "Deaths.Air.Pollution.per.100k"
## [7] "GDP.USD" "SDGRegion"
## [9] "SubRegion" "Population.thousands"
## [11] "geom" "gdp.per.capita"
## ISO.alpha2.code M49.code Year ISO.alpha3.code Country.x
## 157 AO 24 1997 AGO Angola
## 158 AO 24 1990 AGO Angola
## 159 AO 24 1992 AGO Angola
## 160 AO 24 1998 AGO Angola
## 161 AO 24 2001 AGO Angola
## 162 AO 24 2014 AGO Angola
## Deaths.Air.Pollution.per.100k GDP.USD SDGRegion SubRegion
## 157 202.2717564 7648377413 SUB-SAHARANAFRICA MiddleAfrica
## 158 224.5086673 11228764963 SUB-SAHARANAFRICA MiddleAfrica
## 159 219.9268744 8307810974 SUB-SAHARANAFRICA MiddleAfrica
## 160 202.7319031 6506229607 SUB-SAHARANAFRICA MiddleAfrica
## 161 183.0152710 8936063723 SUB-SAHARANAFRICA MiddleAfrica
## 162 103.2917847 145712200313 SUB-SAHARANAFRICA MiddleAfrica
## Population.thousands geom gdp.per.capita
## 157 14872 MULTIPOLYGON (((12.32243167... 514.2803532
## 158 11848 MULTIPOLYGON (((12.32243167... 947.7350577
## 159 12657 MULTIPOLYGON (((12.32243167... 656.3807358
## 160 15360 MULTIPOLYGON (((12.32243167... 423.5826567
## 161 16946 MULTIPOLYGON (((12.32243167... 527.3258423
## 162 26942 MULTIPOLYGON (((12.32243167... 5408.3661314
## 'data.frame': 1239 obs. of 12 variables:
## $ ISO.alpha2.code : Factor w/ 248 levels "AD","AE","AF",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ M49.code : Factor w/ 249 levels "4","8","10","12",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ Year : Factor w/ 28 levels "1990","1991",..: 8 1 3 9 12 25 13 2 18 17 ...
## $ ISO.alpha3.code : Factor w/ 197 levels "","AFG","AGO",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Country.x : Factor w/ 231 levels "Afghanistan",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ Deaths.Air.Pollution.per.100k: num 202 225 220 203 183 ...
## $ GDP.USD : num 7648377413 11228764963 8307810974 6506229607 8936063723 ...
## $ SDGRegion : Factor w/ 9 levels "AUSTRALIA/NEWZEALAND",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ SubRegion : Factor w/ 22 levels "AUSTRALIA/NEWZEALAND",..: 10 10 10 10 10 10 10 10 10 10 ...
## $ Population.thousands : num 14872 11848 12657 15360 16946 ...
## $ geom :sfc_MULTIPOLYGON of length 1239; first list element: List of 2
## ..$ :List of 1
## .. ..$ : num [1:66, 1:2] 12.3 12.2 12.7 12.9 13.2 ...
## ..$ :List of 1
## .. ..$ : num [1:9, 1:2] 13 12.6 12.3 11.9 12.2 ...
## ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
## $ gdp.per.capita : num 514 948 656 424 527 ...
## - attr(*, "na.action")= 'omit' Named int [1:56] 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 ...
## ..- attr(*, "names")= chr [1:56] "5142" "5143" "5144" "5145" ...
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 42.3263 110.8553 143.4512 141.8158 170.7379 259.5389
Train the Model
The next thing we do is create a train and test set, this is done to train our model (train set) and test the prediction (test set). As taught in class, we split the data 80/20, 80% to train the model (991 observations), 20% to make predictions (248 observations). We verified this randomization process and see that low deaths in both sets are about 49% as shown below where we show the train and test breakdown of high vs low.
##
## high low
## 0.5045408678 0.4954591322
##
## high low
## 0.5040322581 0.4959677419
Next we build and train our model as deaths ~ gdp.per.capita + Population.thousands and plot the tree. It is important to note that the rpart() function we employ uses the Gini impurity measure to split the nodes. The Gini Impurity measures which features has least likelihood of misclassification.
## Call:
## rpart(formula = deaths ~ gdp.per.capita + Population.thousands,
## data = train, method = "class", control = list(maxdepth = 4))
## n= 991
##
## CP nsplit rel error xerror xstd
## 1 0.47861507128 0 1.0000000000 1.0773930754 0.03198383265
## 2 0.04887983707 1 0.5213849287 0.5315682281 0.02823991365
## 3 0.01000000000 3 0.4236252546 0.4378818737 0.02642602072
##
## Variable importance
## gdp.per.capita Population.thousands
## 74 26
##
## Node number 1: 991 observations, complexity param=0.4786150713
## predicted class=high expected loss=0.4954591322 P(node) =1
## class counts: 500 491
## probabilities: 0.505 0.495
## left son=2 (672 obs) right son=3 (319 obs)
## Primary splits:
## gdp.per.capita < 1064.866253 to the left, improve=130.81631240, (0 missing)
## Population.thousands < 1977.5 to the right, improve= 23.65467744, (0 missing)
## Surrogate splits:
## Population.thousands < 1300.5 to the right, agree=0.763, adj=0.263, (0 split)
##
## Node number 2: 672 observations, complexity param=0.04887983707
## predicted class=high expected loss=0.318452381 P(node) =0.6781029263
## class counts: 458 214
## probabilities: 0.682 0.318
## left son=4 (235 obs) right son=5 (437 obs)
## Primary splits:
## gdp.per.capita < 347.5075869 to the left, improve=42.27676140, (0 missing)
## Population.thousands < 26171.5 to the left, improve=20.09210181, (0 missing)
##
## Node number 3: 319 observations
## predicted class=low expected loss=0.131661442 P(node) =0.3218970737
## class counts: 42 277
## probabilities: 0.132 0.868
##
## Node number 4: 235 observations
## predicted class=high expected loss=0.07659574468 P(node) =0.2371342079
## class counts: 217 18
## probabilities: 0.923 0.077
##
## Node number 5: 437 observations, complexity param=0.04887983707
## predicted class=high expected loss=0.4485125858 P(node) =0.4409687185
## class counts: 241 196
## probabilities: 0.551 0.449
## left son=10 (371 obs) right son=11 (66 obs)
## Primary splits:
## Population.thousands < 25928.5 to the left, improve=26.793946050, (0 missing)
## gdp.per.capita < 536.6554273 to the left, improve= 2.834010564, (0 missing)
##
## Node number 10: 371 observations
## predicted class=high expected loss=0.3746630728 P(node) =0.3743693239
## class counts: 232 139
## probabilities: 0.625 0.375
##
## Node number 11: 66 observations
## predicted class=low expected loss=0.1363636364 P(node) =0.06659939455
## class counts: 9 57
## probabilities: 0.136 0.864
##
## Classification tree:
## rpart(formula = deaths ~ gdp.per.capita + Population.thousands,
## data = train, method = "class", control = list(maxdepth = 4))
##
## Variables actually used in tree construction:
## [1] gdp.per.capita Population.thousands
##
## Root node error: 491/991 = 0.49545913
##
## n= 991
##
## CP nsplit rel error xerror xstd
## 1 0.478615071 0 1.00000000 1.07739308 0.031983833
## 2 0.048879837 1 0.52138493 0.53156823 0.028239914
## 3 0.010000000 3 0.42362525 0.43788187 0.026426021
As shown above, our decision tree shows at the root node that 50% of countries in Sub-Saharan Africa have high deaths due to air pollution. Then, we see that 68% of these countries have GDP per capita that is less than $1,065 with probability of high deaths due to air pollution (32%). Then we see that out of these countries, 44% has gdp per capita NOT greater than $348 with probability of high deaths due to air pollution- 24% . Lastly, we see that 37% of the countries have population less than about 26,000,000 and GDP per capita greater than $348 and less than $1,065 - with probability of high deaths due to air pollution at 37%.
#Tree Plot
post(afit, file = "afit.ps", title = "Classification Tree for Deaths due to Air Pollution in Sub-Sahara Africa")Predict
Here, we use the above test tree to predict it on our test set. We are predicting which countries are more likely to have higher deaths due to air pollution.
## high low
## 159 89
Evaluation
The next thing we do is evaluate our model’s performance using the confusion matrix, and from that, the accuracy test.
As we all know, TP: (True Positive) means predicted values correctly predicted as positive FP: (False Positive) means predicted values incorrectly predicted an actual positive FN: (False Negative) means positive values predicted as negative TN: (True Negative) means predicted values correctly predicted as an actual negative
#Evaluate
# Confusion matrix and Accuracy test
cm<-with(test, table(apred, deaths))
print(cm)## deaths
## apred high low
## high 110 49
## low 15 74
xkabledply(cm, "confusion matrix")| high | low | |
|---|---|---|
| high | 110 | 49 |
| low | 15 | 74 |
accuracy <- sum(diag(cm)) / sum(cm)
print(paste('Accuracy for test', accuracy))## [1] "Accuracy for test 0.741935483870968"
So we see above that our model accurately predicted 110 countries with high deaths due to air pollution (True negative) and 74 countries with low deaths due to air pollution (True positive), however it misclassified 49 countries for the former (False positive) and 15 countries for the latter(True negative).
The overall accuracy is 74%. Here the accuracy-test from the confusion matrix is calculated and is found to be 0.741935 Hence this model is found to predict with an accuracy of 74%.
Cross-Entropy: A third alternative, which is similar to the Gini Index, is known as the Cross-Entropy or Deviance:
Prune the tree
The last thing we want to do here is prune our tree, this will minimize overfitting the data by minimizing the cross-validated error.
#prune the tree
pafit <- prune(afit, cp = afit$cptable[2,"CP"])
printcp(pafit)##
## Classification tree:
## rpart(formula = deaths ~ gdp.per.capita + Population.thousands,
## data = train, method = "class", control = list(maxdepth = 4))
##
## Variables actually used in tree construction:
## [1] gdp.per.capita
##
## Root node error: 491/991 = 0.49545913
##
## n= 991
##
## CP nsplit rel error xerror xstd
## 1 0.478615071 0 1.00000000 1.07739308 0.031983833
## 2 0.048879837 1 0.52138493 0.53156823 0.028239914
#cp is cost complexity and has to be minimum.
# plot the pruned tree
fancyRpartPlot(pafit,main="Pruned Classification Tree for Deaths due to Air Polution")## high low
## 174 74
## [1] "Accuracy for test 0.741935483870968"
As shown above, accuracy is now 74%. Pruning had little effect even though the tree is now simpler. So, we have a model that can predict high deaths due to air pollution with 74% accuracy.
# create attractive postcript plot of tree
post(afit, file = "afit.ps", title = "Pruned Classification Tree for Deaths due to Air Pollution in Sub-Sahara Africa")How can we improve this model in the future?
Instead of of an individual tree, we can employ other techniques such as bagging, random forests, and boosting, which will significantly improve our predictive performance!
3. Can we let the data tell us what type of groupings exist in our dataset? How consistent are they to our preconceived groupings such as region or developed vs developing countries?
We wanted to better understand what types of groupings exist within our dataset. More concretely, we wanted to put aside our preconceived presumptions about our dataset and create a K-Means clustering model and allow the the data to guide us to the possible clusters that may exist within our data.
K-Means clustering is an unsupervised machine learning algorithm that is very useful to parse the dataset and identify K number of groups where observations within each group have high similarity, all without the need of labeled data.
To begin our process, let’s first create an index label to allow mapping each datapoint back to a country after clustering.
| index_ | region | Deaths.Air.Pollution.per.100k | gdp.per.capita | Population.thousands | |
|---|---|---|---|---|---|
| Australia | Australia | AUSTRALIA/NEWZEALAND | 17.76814718 | 35411.5137067 | 20282.892857 |
| New Zealand | New Zealand | AUSTRALIA/NEWZEALAND | 15.92536379 | 25278.8349742 | 4056.678571 |
| Afghanistan | Afghanistan | CENTRALANDSOUTHERNASIA | 227.76525452 | 430.2877879 | 29282.000000 |
| Bangladesh | Bangladesh | CENTRALANDSOUTHERNASIA | 144.51373473 | 619.5903166 | 133735.607143 |
| Bhutan | Bhutan | CENTRALANDSOUTHERNASIA | 115.89647264 | 1408.2427343 | 627.000000 |
| India | India | CENTRALANDSOUTHERNASIA | 165.91366818 | 838.3205492 | 1115445.428571 |
Next, let’s select only the numerical values from our dataset to use in clustering.
| variable | class | first_values |
|---|---|---|
| Deaths.Air.Pollution.per.100k | double | 17.768147178761, 15.9253637864585, 227.765254524006, 144.513734731741, 115.896472640701, 165.913668178177 |
| gdp.per.capita | double | 35411.5137067032, 25278.8349742491, 430.287787933406, 619.590316576761, 1408.24273427377, 838.320549169855 |
| Population.thousands | double | 20282.8928571429, 4056.67857142857, 29282, 133735.607142857, 627, 1115445.42857143 |
Optimal K Clusters
The key hyperparameter that we will need to optimize for is the optimal K value to achieve best results of low model complexity but still capturing relevant clusters.
The 2 methods to find optimal K is the following:
- Elbow Method with Total Within Sum of Squares
- Looking at the Gap Statistics
We will try both. Let’s begin with the first method.
Elbow Method with Total Within Sum of Squares
- Figure 6.3-2: Finding the optimal number of clusters using the Within Sum of Squares Method.
From the total Within Sum of Squares plot, we find that the optimal number of K seems to be at 4 using the Elbow method which identifies the K number where the Total Sum of Squares begins to level off, or where adding additional K clusters only improves our model marginally.
Looking at the Gap Statistics
- Figure 6.3-3: Finding the optimal number of clusters using the Gap Statistic.
From the Gap Statistic plot, we find that the optimal number of K seems to be at 10 with the highest gap statistic, however, we observed using the Elbow method that the Total Sum of Squares begins to level off from 4 to 10. We will stick with using K = 4.
Perform K-Means Clustering with Optimal K
We perform a K-Means clustering analysis with a K of 4 and here are the results below.
| Cluster | Deaths.Air.Pollution.per.100k | gdp.per.capita | Population.thousands |
|---|---|---|---|
| 1 | 1.16234711051454 | -0.61738173258503 | -0.101796957671015 |
| 2 | 1.01091758337588 | -0.564333841794921 | 9.25163623841697 |
| 3 | -1.04092828247631 | 1.92017020504228 | -0.0578237235004342 |
| 4 | -0.494528110744794 | -0.253250480051675 | -0.107965219042902 |
| Cluster | Observations with Each Cluster |
|---|---|
| 1 | 67 |
| 2 | 2 |
| 3 | 34 |
| 4 | 90 |
| Cluster | Total Sum of Squares with Each Cluster |
|---|---|
| 1 | 30.8781 |
| 2 | 1.5231 |
| 3 | 34.8756 |
| 4 | 26.9591 |
Let’s take a look at clusters visually. Interestingly, our model has grouped China and India into their own separate cluster, Cluster 2. Clusters 4, 3, and 1 can be roughly characterized by high GDP per captita nations, medium GDP per capita nations, and low GDP per capita nations’.
- Figure 3.6-7: Visualing K-Means Clustering Model.
Let’s take a closer look at each cluster’s breakdown of subregions more closely to determine what type of discrepancies we find between each cluster.
- Figure 6.3-8: Frequency % of SDGRegion by Clusters.
Interestingly, we observe that Cluster 1 is disproportionately comprised of countries in Sub-Saharan Africa with large representations from Oceania, Eastern & South-Eastern Asia, and Central & Southern Asia. Cluster 2 is the one cluster comprised solely of China and India. Cluster 3 has a much larger proportion of countries from Europe, Northen America, and Northern Africa & Western Asia. Cluster 4 is similar to Cluster 3 except it does not have countries from Northern America and has more countries from Sub-Saharan Africa and Central & Southern Asia.
4. Can we make a prediction of the future GDP by considering the population size, location, and the reation of population and deaths due to air pollution?
For our major model we make a nice prediction for GDP through total deaths caused by air pollution, so as we can see that through the distribution of the data, there are strong relationships between total deaths caused by air pollution, deaths caused by indoor air pollution, and deaths caused by outdoor air pollution.
So it seems like we can also get a nice GDP prediction by using deaths caused by indoor air pollution and deaths caused by outdoor air pollution as the basis of the prediction model.
For the first model, we build a linear model using ‘gdp.per.capita’ and ‘Deaths.Air.Pollution.Indoor.per.100k’.
##
## Call:
## lm(formula = gdp.per.capita ~ Deaths.Air.Pollution.Indoor.per.100k,
## data = final_df_sfc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15508268 -8789754 -3652222 3500260 102785447
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 16339919.456 254680.707 64.15845
## Deaths.Air.Pollution.Indoor.per.100k -126646.729 3307.569 -38.28997
## Pr(>|t|)
## (Intercept) < 0.000000000000000222 ***
## Deaths.Air.Pollution.Indoor.per.100k < 0.000000000000000222 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13887560 on 5195 degrees of freedom
## Multiple R-squared: 0.2201014, Adjusted R-squared: 0.2199512
## F-statistic: 1466.122 on 1 and 5195 DF, p-value: < 0.00000000000000022204
- Figure 6.4-1: Deaths due to Indoor Air Pollution vs GDP per capita.
The R-squared value of this model is 0.22, which is a really bad result. So follow the steps in the major model building, now we put these two features into a log() and it should performed better.
##
## Call:
## lm(formula = log(gdp.per.capita) ~ log(Deaths.Air.Pollution.Indoor.per.100k),
## data = final_df_sfc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3535187 -0.5992551 0.0980527 0.6472414 2.8921185
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 16.332565271 0.017626039 926.6157
## log(Deaths.Air.Pollution.Indoor.per.100k) -0.547607424 0.005167888 -105.9635
## Pr(>|t|)
## (Intercept) < 0.000000000000000222 ***
## log(Deaths.Air.Pollution.Indoor.per.100k) < 0.000000000000000222 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8906039 on 5195 degrees of freedom
## Multiple R-squared: 0.6836804, Adjusted R-squared: 0.6836195
## F-statistic: 11228.26 on 1 and 5195 DF, p-value: < 0.00000000000000022204
- Figure 6.4-2: Log Deaths due to Indoor Air Pollution vs Log GDP per capita.
As we can see, although it performed much better with a R-squared value 0.68 but this is still not good as the major model.
Next we try the deaths caused by outdoor air pollution ‘Deaths.Air.Pollution.Outdoor.Total.per.100k’.
##
## Call:
## lm(formula = gdp.per.capita ~ Deaths.Air.Pollution.Outdoor.Total.per.100k,
## data = final_df_sfc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14051236 -9537840 -5335898 2518144 106063637
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 15643215.32 461879.67 33.86859
## Deaths.Air.Pollution.Outdoor.Total.per.100k -150382.23 10830.39 -13.88521
## Pr(>|t|)
## (Intercept) < 0.000000000000000222 ***
## Deaths.Air.Pollution.Outdoor.Total.per.100k < 0.000000000000000222 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15441650 on 5195 degrees of freedom
## Multiple R-squared: 0.0357844, Adjusted R-squared: 0.03559879
## F-statistic: 192.7991 on 1 and 5195 DF, p-value: < 0.00000000000000022204
- Figure 6.4-3: Deaths due to Outdoor Air Pollution vs GDP per capita.
The performence of this model is also bad, with a R-squared value 0.036, so we put the features in the log() too.
##
## Call:
## lm(formula = log(gdp.per.capita) ~ log(Deaths.Air.Pollution.Outdoor.Total.per.100k),
## data = final_df_sfc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5907203 -1.2717139 -0.0273497 1.2970215 3.4713403
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 15.69882895 0.15654605
## log(Deaths.Air.Pollution.Outdoor.Total.per.100k) -0.19907742 0.04418098
## t value
## (Intercept) 100.28250
## log(Deaths.Air.Pollution.Outdoor.Total.per.100k) -4.50595
## Pr(>|t|)
## (Intercept) < 0.000000000000000222 ***
## log(Deaths.Air.Pollution.Outdoor.Total.per.100k) 0.0000067525 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.580427 on 5195 degrees of freedom
## Multiple R-squared: 0.003893084, Adjusted R-squared: 0.00370134
## F-statistic: 20.30361 on 1 and 5195 DF, p-value: 0.000006752534
- Figure 6.4-4: Log Deaths due to Outdoor Air Pollution vs Log GDP per capita.
A surprised point is that although we put the variables into log(), the performance of model is still bad and even worse. The R-squared value of this model is 0.0039.
Finally, we are going to test build a model base on both deaths caused by outdoor and indoor air pollution. To check if this model also works bad.
##
## Call:
## lm(formula = gdp.per.capita ~ Deaths.Air.Pollution.Indoor.per.100k +
## Deaths.Air.Pollution.Outdoor.Total.per.100k, data = final_df_sfc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18198746 -7911990 -2886359 3411956 96919910
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 26403084.065 457576.540 57.70201
## Deaths.Air.Pollution.Indoor.per.100k -144488.737 3189.786 -45.29732
## Deaths.Air.Pollution.Outdoor.Total.per.100k -242554.830 9393.524 -25.82149
## Pr(>|t|)
## (Intercept) < 0.000000000000000222 ***
## Deaths.Air.Pollution.Indoor.per.100k < 0.000000000000000222 ***
## Deaths.Air.Pollution.Outdoor.Total.per.100k < 0.000000000000000222 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13075010 on 5194 degrees of freedom
## Multiple R-squared: 0.3088267, Adjusted R-squared: 0.3085606
## F-statistic: 1160.379 on 2 and 5194 DF, p-value: < 0.00000000000000022204
- Figure 6.4-5: Deaths due to Indoor + Outdoor Air Pollution vs GDP per capita.
The R-squared value of this model is 0.29. Next we put these variables into log().
##
## Call:
## lm(formula = log(gdp.per.capita) ~ log(Deaths.Air.Pollution.Outdoor.Total.per.100k) +
## log(Deaths.Air.Pollution.Indoor.per.100k), data = final_df_sfc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1996055 -0.5942513 0.0554719 0.6207513 2.9007194
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 17.606051113 0.088198404
## log(Deaths.Air.Pollution.Outdoor.Total.per.100k) -0.359853980 0.024440072
## log(Deaths.Air.Pollution.Indoor.per.100k) -0.552122333 0.005073061
## t value
## (Intercept) 199.61870
## log(Deaths.Air.Pollution.Outdoor.Total.per.100k) -14.72393
## log(Deaths.Air.Pollution.Indoor.per.100k) -108.83415
## Pr(>|t|)
## (Intercept) < 0.000000000000000222 ***
## log(Deaths.Air.Pollution.Outdoor.Total.per.100k) < 0.000000000000000222 ***
## log(Deaths.Air.Pollution.Indoor.per.100k) < 0.000000000000000222 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8726636 on 5194 degrees of freedom
## Multiple R-squared: 0.6963544, Adjusted R-squared: 0.6962374
## F-statistic: 5955.733 on 2 and 5194 DF, p-value: < 0.00000000000000022204
- Figure 6.4-6: Log Deaths due to Indoor + Outdoor Air Pollution vs Log GDP per capita.
For this model the R-squared value is 0.69. This is actually not a bad result but when comparing to the major model using total deaths caused by air pollution, 0.69 is not good enough for predicting GDP.
We can get the conclusion that we can not get a nice GDP prediction by using deaths caused by indoor air pollution and deaths caused by outdoor air pollution as the basis of the prediction model.
7. Conclusion of SMART Modeling Questions
For the Logit Regression, we observed small p-values indicating that all the coefficients are found to be significant. GDP has a positive effect on population, but deaths due to air pollution have a negative effect on population.
As shown above, post-pruning, our accuracy is still 74%. So, pruning had little effect even though the tree is now much simpler. Therefore, we have a model that can predict high deaths due to air pollution with 74% accuracy.
For the clustering, it is really interesting to observe that the main clusters 1, 3, and 4 have large discrepancies in regional representation while the data we used for clustering did not have any explicit geospatial components. This can mean that regionality, although may not be the cause, does have some correlation in determining the groups of factors in GDP per Capita, Population, and Deaths due to Air Pollution per 100k.
Although death caused by indoor air pollution and outdoor air pollution seems to have a strong relationship with total death caused by air pollution, but we can’t accurately predict GDP through death caused by indoor and outdoor air pollution.
8. Bibliography
| Number | APA Citation |
|---|---|
| 1 | Robin Lovelace, J. N. (n.d.). Chapter 8 Making maps with R: Geocomputation with R. Retrieved October 28, 2021, from https://geocompr.robinlovelace.net/adv-map.html |
| 2 | Robin Lovelace, J. N. (2021, October 28). Chapter 2 Geographic data in R: Geocomputation with R. Retrieved from https://geocompr.robinlovelace.net/spatial-class.html#intro-sf |
| 3 | Hadley Wickham, D. N. (2021, October 28). 6 Maps. Retrieved from https://ggplot2-book.org/maps.html |
| 4 | Customizing ggplot2 color and fill scales. (2021, October 28). Retrieved from https://spielmanlab.github.io/introverse/articles/color_fill_scales.html |
| 5 | Logarithmic Functions. (2021, October 28). Retrieved from https://saylordotorg.github.io/text_intermediate-algebra/s10-03-logarithmic-functions-and-thei.html |
| 6 | K-Means Clustering in R. (2021, December 3). Retrieved from https://www.statology.org/k-means-clustering-in-r/ |
| 7 | Confusion Matrix in Machine Learning. (2021, December 3). Retrieved from https://www.guru99.com/confusion-matrix-machine-learning-example.html |